Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip single file in a partition from OPTIMIZE #23864

Merged
merged 5 commits into from
Oct 25, 2024

Conversation

raunaqmorarka
Copy link
Member

Description

Single file without a delete in a partition can't be optimized any further.
Avoiding rewriting such files improves the performance of OPTIMIZE on
large partitioned tables

Additional context and related issues

Fixes #10785

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Iceberg
* Improve performance of OPTIMIZE on large partitioned tables. ({issue}`10785`)

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only cosmetics spotted.
I appreciate seeing the testing coverage for the newly added functionality.

List<Object> splitsInfo = ImmutableList.copyOf(scannedFiles.build());
log.info("Generated %d splits, skipped %d files for OPTIMIZE", splitsInfo.size(), filesSkipped);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not debug?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful enough to be at info, until we enhance optimize to print some useful summary at the end of execution. We get a bunch of useless logs out of iceberg library anyway in IcebergSplitSource, this isn't adding much to the noise.

private ImmutableSet.Builder<DataFileWithDeleteFiles> scannedFiles = ImmutableSet.builder();
@Nullable
private Map<StructLikeWrapperWithFieldIdToIndex, Optional<FileScanTaskWithDomain>> scannedFilesByPartition = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that there is no need to log how many files were skipped.
Consider using a local variable in the method instead of adding more state to the class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This state is required regardless of the log line. We need to maintain scannedFilesByPartition for the entire lifetime of the split source as file tasks iteration is not guaranteed to be partition at a time.
Having the log line is useful for knowing what happened. We should probably extend OPTIMIZE to print some useful stats like this at the end of the execution.

@@ -304,6 +277,54 @@ private CompletableFuture<ConnectorSplitBatch> getNextBatchInternal(int maxSize)
return completedFuture(new ConnectorSplitBatch(splits, isFinished()));
}

private Iterator<FileScanTaskWithDomain> prepareFileTasksIterator(List<FileScanTaskWithDomain> fileScanTasks)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are materializing FileScanTaskWithDomain so just return LIst< FileScanTaskWithDomain> here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more convenient to return iterator as the rest of the existing code is written to work on an iterator

import java.util.Objects;
import java.util.stream.IntStream;

public class StructLikeWrapperWithFieldIdToIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice class name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credit to @homar ;)

Copy link
Member

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit hard to read - but logic seems fine

Keeping domain attached to the relevant FileScanTask is safer than
handling it as member variable of the class
This will be used in subsequent commit to skip unncessary files from optimize
Single file without a delete in a partition can't be
optimized any further
@raunaqmorarka raunaqmorarka merged commit 75fa7cb into trinodb:master Oct 25, 2024
45 checks passed
@raunaqmorarka raunaqmorarka deleted the iceberg-opt-skip branch October 25, 2024 15:22
@github-actions github-actions bot added this to the 464 milestone Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Skip OPTIMIZE in Iceberg when table/partition has only one file
3 participants